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Salient Object Detection: A Benchmark 
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Abstract —We extensively compare, qualitatively and quan¬ 
titatively, 40 state-of-the-art models (28 salient object de¬ 
tection, 10 fixation prediction, 1 objectness, and 1 baseline) 
over 6 challenging datasets for the purpose of benchmarking 
salient object detection and segmentation methods. From the 
results obtained so far, our evaluation shows a consistent rapid 
progress over the last few years in terms of both accuracy 
and running time. The top contenders in this benchmark 
significantly outperform the models identified as the best in 
the previous benchmark conducted just two years ago. We 
find that the models designed specifically for salient object 
detection generally work better than models in closely related 
areas, which in turn provides a precise definition and suggests 
an appropriate treatment of this problem that distinguishes it 
from other problems. In particular, we analyze the infiuences 
of center bias and scene complexity in model performance, 
which, along with the hard cases for state-of-the-art models, 
provide useful hints towards constructing more challenging 
large scale datasets and better saliency models. Finally, we 
propose probable solutions for tackling several open problems 
such as evaluation scores and dataset bias, which also suggest 
future research directions in the rapidly-growing field of 
salient object detection. 

Index Terms —Salient object detection, saliency, explicit 
saliency, visual attention, regions of interest, objectness, seg¬ 
mentation, interestingness, importance, eye movements 

1. Introduction 

V ISUAL attention, the astonishing capability of human 
visual system to selectively process only the salient 
visual stimuli in details, has been investigated by multiple 
disciplines such as cognitive psychology, neuroscience, 
and computer vision [2]-[5]. Following cognitive theories 
{e.g., feature integration theory (FIT) [6], guided search 
model [7], [8]) and early attention models (e.g., Koch and 
Ullman [9] and Itti et al [10]), hundreds of computational 
saliency models have been proposed to detect salient visual 
subsets from images and videos. 

Despite the psychological and neurobiological defini¬ 
tions, the concept of visual saliency is becoming vague in 
the field of computer vision. Some visual saliency models 
(e.g., [3], [10]-[16]) aimed to predict human fixations as a 
way to test their accuracy in saliency detection, while other 
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models [17]-[19], which were often driven by computer 
vision applications such as content-aware image resizing 
and photo visualization [20], attempted to identify salient 
regions/objects and used explicit saliency judgments for 
evaluation [21]. Although both types of saliency models are 
expected to be applicable interchangeably, their generated 
saliency maps actually demonstrate remarkably different 
characteristics due to the distinct purposes in saliency 
detection. For example, fixation prediction models usually 
pop-out sparse blob-like salient regions, while salient object 
detection models often generate smooth connected areas. 
On the one hand, detecting large salient areas often causes 
severe false positives for fixation prediction. On the other 
hand, popping-out only sparse salient regions causes mas¬ 
sive misses in detecting salient regions and objects. 

To separate these two types of saliency models, in this 
study we provide a precise definition and suggest an ap¬ 
propriate treatment of salient object detection. Generally, a 
salient object detection model should, detect the salient 
attention-grabbing objects in a scene, and second, segment 
the entire objects. Usually, the output of the model is a 
saliency map where the intensity of each pixel represents 
its probability of belonging to salient objects. From this 
definition, we can see that this problem in its essence 
is a figure/ground segmentation problem, and the goal is 
to only segment the salient foreground object from the 
background. Note that it slightly differs from the traditional 
image segmentation problem that aims to partition an image 
into perceptually coherent regions. 

The value of salient object detection models lies on their 
applications in many areas such as computer vision, graph¬ 
ics, and robotics. For instance, these models have been suc¬ 
cessfully applied in many applications such as object detec¬ 
tion and recognition [22]-[29], image and video compres¬ 
sion [30], [31], video summarization [32]-[34], photo col¬ 
lage/media re-targeting/cropping/thumb-nailing [20], [35], 
[36], image quality assessment [37]-[39], image segmen¬ 
tation [40]-[43], content-based image retrieval and image 
collection browsing [44]-[47], image editing and manipu¬ 
lating [48]-[51], visual tracking [52]-[58], object discov¬ 
ery [59], [60], and human-robot interaction [61], [62]. The 
field of salient object detection develops very fast. Many 
new models and benchmark datasets have been proposed 
since our earlier benchmark conducted two years ago [1]. 
Yet, it is unclear how the new algorithms fare against 
previous models and new datasets. Are there any real 
improvements in this field or we are just fitting models to 
datasets? It is also interesting to test the performance of old 
high-performing models on the new benchmark datasets. A 
recent exhaustive review of salient object detection models 
can be found in [28]. 
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In this Study, we compare and analyze models from 
three categories: 1) salient object detection, 2) fixation 
prediction, and 3) object proposal generation^ The reason 
to include the latter two types of models is to conduct 
across-category comparison and to study whether mod¬ 
els specifically designed for salient object detection show 
actual advantage over models for fixation prediction and 
object proposal generation. This is particularly important 
since these models have different objectives and generate 
visually distinctive maps. We also include a baseline model 
to study the effect of center bias in model comparison. 
In summary, we hope that such a benchmark not only 
allows researchers to compare their models with other 
algorithms but also helps identify the chief factors affecting 
the performance of salient object detection models. 

II. Salient Object Detection Benchmark 

In this benchmarking, we focus on evaluating models 
whose input is a single image. This is due to the fact that 
salient object detection on a single input image is the main 
research direction, while the comprehensive evaluation of 
models working on multiple input images (e.g., co-salient 
object detection) lacks public benchmark datasets. 

A. Compared Models 

In this study, we run 40 models in total (28 salient 
object detection models, 10 fixation prediction models, 1 
objectness proposal model, and 1 baseline) whose codes or 
executables were accessible (see Fig. 1 for a complete list). 
The baseline model, denoted as “Average Annotation Map 
(AAM),” is simply the average of ground-truth annotations 
of all images on each dataset. Note that AAM often has a 
larger activation at the image center (see Fig. 2), and we can 
thus study the effect of center bias in model comparison. 

B. Datasets 

Since there exist many datasets that differ in number of 
images, number of objects per image, image resolution and 
annotation form (bounding box or accurate region mask), it 
is likely that models may rank differently across datasets. 
Hence, to come up with a fair comparison, it is necessary 
to run models over multiple datasets so as to draw objective 
conclusions. A good model should perform well over 
almost all datasets. Toward this end, six datasets were cho¬ 
sen for model comparison, including: 1) MSRAIOK [98], 
2) ECSSD [75], 3) THUR15K [98], 4) JuddDB [99], 
5) DUT-OMRON [76], and 6) SED2 [1], [100]. These 
datasets were selected based on the following four criteria: 
1) being widely-used, 2) containing a large number of 
images, 3) having different biases {e.g., number of salient 
objects, image clutter, center-bias), and 4) potential to be 
used as benchmarks in the future research. 

MSRAIOK is a descendant of the MSRA dataset [17]. It 
contains 10,000 annotated images that covers all the 1,000 

^Object proposal generation is a recently emerging trend which attempts 
to detect image regions that may contain objects from any object category 
(i.e., category independent object proposals). 


# 

Model 

Pub 

Year 

Code 

Time(s) 

Cat. 

I 

LC [63] 

MM 

2006 

C 

.009 


2 

AC [64] 

levs 

2008 

C 

.129 


3 

FT [18] 

CVPR 

2009 

C 

.072 


4 

CA [65] 

CVPR 

2010 

M + C 

40.9 


5 

MSS [66] 

ICIP 

2010 

C 

.076 


6 

SEG [67] 

ECCV 

2010 

M + C 

10.9 


7 

RC [68] 

CVPR 

2011 

C 

.136 


8 

HC [68] 

CVPR 

2011 

C 

.017 


9 

SWD [69] 

CVPR 

2011 

M + C 

.190 


10 

SVO [70] 

ICCV 

2011 

M + C 

56.5 


11 

CB [71] 

BMVC 

2011 

M + C 

2.24 

S3 

_o 

12 

FES [72] 

Img.Anal. 

2011 

M + C 

.096 


13 

SE [73] 

CVPR 

2012 

C 

.202 


14 

LMLC [74] 

TIP 

2013 

M + C 

140. 

O 

15 

HS [75] 

CVPR 

2013 

EXE 

.528 

u 

0) 

16 

GMR [76] 

CVPR 

2013 

M 

.149 


17 

DRFI [77] 

CVPR 

2013 

C 

.697 


18 

PCA [78] 

CVPR 

2013 

M + C 

4.34 

g 

19 

LBI [79] 

CVPR 

2013 

M + C 

251. 


20 

GC [80] 

ICCV 

2013 

C 

.037 


21 

CHM [81] 

ICCV 

2013 

M + C 

15.4 


22 

DSR [82] 

ICCV 

2013 

M + C 

10.2 


23 

MC [83] 

ICCV 

2013 

M + C 

.195 


24 

UFO [84] 

ICCV 

2013 

M + C 

20.3 


25 

MNP [50] 

Vis. Comp. 

2013 

M + C 

21.0 


26 

GR [85] 

SPL 

2013 

M + C 

1.35 


27 

RBD [86] 

CVPR 

2014 

M 

.269 


28 

HDCT [87] 

CVPR 

2014 

M 

4.12 


1 

IT [10] 

PAMI 

1998 

M 

.302 


2 

AIM [88] 

JOV 

2006 

M 

8.66 

a 

3 

GB [89] 

NIPS 

2007 

M + C 

.735 

o 

4 

SR [90] 

CVPR 

2007 

M 

.040 

'S 

5 

SUN [91] 

JOV 

2008 

M 

3.56 

A . 

6 

SeR [92] 

JOV 

2009 

M 

1.31 

Mh 

g 

7 

SIM [93] 

CVPR 

2011 

M 

1.11 


8 

SS [94] 

PAMI 

2012 

M 

.053 


9 

COV [95] 

JOV 

2013 

M 

25.4 


10 

BMS [96] 

ICCV 

2013 

M + C 

.575 


1 

OBJ [97] 

CVPR 

2010 

M+C 

3.01 

- 

1 

AAM 

- 

- 

- 

- 

- 


Fig. I. Compared salient object detection, fixation prediction, object 
proposal generation, and baseline models sorted by their publication year 
{M= Matlab, C= C/C++, EXE = executable}. The average running time 
is tested on MSRAIOK dataset (typical image resolution 400 x 300) using 
a desktop machine with Xeon E5645 2.4 GHz CPU and 8GB RAM. We 
evaluate those models whose codes or executables are available. 


images in the popular ASD dataset [18]. THUR15K and 
DUT-OMRON are used to compare models on a large 
scale. ECSSD contains a large number of semantically 
meaningful but structurally complex natural images. The 
reason to include JuddDB was to assess performance 
of models over scenes with multiple objects with high 
background clutter. Finally, we also evaluate models over 
SED2 to check whether salient object detection algorithms 
can perform well on images containing more than one 
salient object {i.e., two in SED2). Fig. 2 shows the AAM 
model output of six benchmark datasets to illustrate their 
different center biases. See Fig. 3 for representative images 
and annotations from each dataset. 

We illustrate in Fig. 4 the statistics of the six chosen 
datasets. In Fig. 4(a), we show the normalized distances 
from the centroid of salient objects to the corresponding im¬ 
age centers. We can see that salient objects in ECCSD have 
the shortest distance to image centers, while salient objects 
in SED2 have the longest distances. This is reasonable since 
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(a) MSRAIOK (b) ECSSD (c) THUR15K 



(d) DUT-OMRON (e) JuddDB (f) SED2 


Fig. 2. Average annotation maps of six datasets used in benchmarking. 


images in SED2 usually have two objects aligned around 
opposite image borders. Moreover, we can see that the 
spatial distribution of salient objects in JuddDB has a larger 
variety than other datasets, indicating that this dataset have 
smaller positional bias {i.e., center-bias of salient objects 
and border-bias of background regions). 

In Fig. 4(b), we aim to show the complexity of images 
in six benchmark datasets. Toward this end, we apply the 
segmentation algorithm by Felzenszwalb et al [101] to see 
how many super-pixels (i.e., homogeneous regions) can be 
obtained on average from salient objects and background 
regions of each image, respectively. In this manner, we can 
use this measure to reflect how challenging a benchmark 
is since massive super-pixels often indicate complex fore¬ 
ground objects and cluttered background. From Fig. 4(c), 
we can see that JuddDB is the most challenging benchmark 
since it has an average number of 493 super-pixels from the 
background of each image. On the contrary, SED2 contains 
fewer number of super-pixels in foreground and background 
regions, indicating that images in this benchmark often 
contain uniform regions and are easy to process. 

In Fig. 4(c), we demonstrate the average object sizes 
of these benchmarks, while the size of each object is 
normalized by the size of the corresponding image. We 
can see that MSRAIOK and ECCSD datasets have larger 
objects while SED2 has smaller ones. In particular, we 
can see that some benchmarks contain a limited number 
of image regions with large foreground objects. By jointly 
considering the center-bias property, it becomes very easy 
to achieve a high precision on these images. 

C. Evaluation Measures 

There are several ways to measure the agreement be¬ 
tween model predictions and human annotations [21]. Some 
metrics evaluate the overlap between a tagged region while 
others try to assess the accuracy of drawn shapes with 
object boundary. In addition, some metrics have tried to 
consider both boundary and shape [102]. 

Here, we use three universally-agreed, standard, and 
easy-to-understand measures for evaluating a salient object 
detection model. The first two evaluation metrics are based 
on the overlapping area between subjective annotation and 




(c) JuddDB 


(d) DUT-OMRON 



* 4i 

. ! 


* %- 


(e) THUR15K (f) SED2 


Fig. 3. Images and pixel-level annotations from six salient object datasets. 


saliency prediction, including the precision-recall (PR) and 
the receiver operating characteristics (ROC). From these 
two metrics, we also report the F-Measure, which jointly 
considers recall and precision, and AUC, which is the area 
under the ROC curve. Moreover, we also use the third 
measure which directly computes the mean absolute error 
(MAE) between the estimated saliency map and ground- 
truth annotation. For the sake of simplification, we use S to 
represent the predicted saliency map normalized to [0, 255] 
and G to represent the ground-truth binary mask of salient 
objects. For a binary mask, we use | • | to represent the 
number of non-zero entries in the mask. 


Precision-recall (PR). For a saliency map S, we can 
convert it to a binary mask M and compute Precision 
and Recall by comparing M with ground-truth G: 


Precision = 


|MnG| 

\M\ 


Recall = 


\MnG\ 

|G| 


( 1 ) 


From this definition, we can see that the binarization 
of S is the key step in the evaluation. Usually, there are 
three popular ways to perform the binarization. In the first 
solution, Achanta et al. [18] proposed the image-dependent 
adaptive threshold for binarizing S, which is computed as 
twice as the mean saliency of S\ 


Tn = 


W xH 


^ — ^x=l ^ — ^y=l 


( 2 ) 


where W and H are the width and the height of the saliency 
map S, respectively. 

The second way to bipartite S is to use a fixed threshold 
which changes from 0 to 255. On each threshold, a pair 
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Fig. 4. Statistics of the benchmark datasets, a) distribution of normalized object distance from image center, b) distribution of super-pixel number on 
salient objects and image background, and c) distribution of normalized object size. 


of precision/recall scores are computed, and are finally 
combined to form a precision-recall (PR) curve to describe 
the model performance at different situations. 

The third way of binarization is to use the SaliencyCut 
algorithm [68]. In this solution, a loose threshold, which 
typically results in good recall but relatively poor precision, 
is used to generate the initial binary mask. Then the method 
iteratively uses the GrabCut segmentation method [103] to 
gradually refines the binary mask. The final binary mask is 
used to re-compute the precision-recall value. 

F-measure. Usually, neither Precision nor Recall can 
comprehensively evaluate the quality of a saliency map. To 
this end, the F-measure is proposed as a weighted harmonic 
mean of them with a non-negative weight p: 

^ (1 -h P‘^)Precision x Recall 

^ Precision + Recall 

As suggested by many salient object detection works (e.g., 
[18], [68], [73]), is set to 0.3 to raise more importance 
to the Precision value. The reason for weighting precision 
more than recall is that recall rate is not as important as 
precision (see also [104]). For instance, 100% recall can be 
easily achieved by setting the whole region to foreground. 

According to the different ways for saliency map bina¬ 
rization, there exist two ways to compute F-Measure. When 
the adaptive threshold or GrabCut algorithm is used for the 
binarization, we can generate a single for each image 
and the final F-Measure is computed as the average Fj^. 
When using fixed thresholding, the resulted PR curve can 
be scored by its maximal which is a good summary 
of the detection performance (as suggested in [105]). As 
defined in (3), F-Measure is the weighted harmonic mean 
of precision and recall, thus share the same value bounds 
as precision and recall values, i.e. [0, 1]. 

Receiver operating characteristics (ROC) curve. In ad¬ 
dition to the Precision, Recall and we can also 
report the false positive rate (FPR) and true positive rate 
(T PR) when binarizing the saliency map with a set of fixed 



Fig. 5. PR and ROC curves for BMS [96] and GB [89] over ECSSD. 


thresholds: 


TPR = 


\Mr\G\ 


|MnG| 


(4) 


where M and G denote the opposite of the binary mask M 
and ground-truth, respectively. The ROC curve is the plot 
of TPR versus FPR by varying the threshold Tf. 

Area under ROC curve (AUC) score. While ROC is a 
two-dimensional representation of a model’s performance, 
the AUC distills this information into a single scalar. As 
the name implies, it is calculated as the area under the 
ROC curve. A perfect model will score an AUC of 1, while 
random guessing will score an AUC around 0.5. 


Mean absolute error (MAE) score. The overlap-based 
evaluation measures introduced above do not consider the 
true negative saliency assignments, i.e., the pixels correctly 
marked as non-salient. This favors methods that success¬ 
fully assign saliency to salient pixels but fail to detect 
non-salient regions over methods that successfully detect 
non-salient pixels but make mistakes in determining the 
salient ones [73], [80]. Moreover, in some application 
scenarios [106] the quality of the weighted, continuous 
saliency maps may be of higher importance than the binary 
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Fig. 6. Precision (vertical axis) and recall (horizontal axis) curves of saliency methods on 6 popular benchmark datasets. 


masks. For a more comprehensive comparison we therefore 
also evaluate the mean absolute error (MAE) between the 
continuous saliency map S and the binary ground truth G, 
both normalized in the range [0, 1]. The MAE score is 
defined as: 


MAE = ^ X] ,\S{x,y)-Gix,y)\ (5) 

Note that these scores sometimes do not agree with each 
other. For example, Fig. 5 shows a comparison of two 
models over ECSSD using PR and ROC metrics. While 
there is not a big difference in ROC curves (thus about 
the same AUC), one model clearly scores better using the 
PR curve (thus having higher F/ 3 ). Such disparity between 
the ROC and PR measures has been extensively studied 
in [107]. Note that the number of negative examples (non¬ 
salient pixels) is typically much bigger than the number 
of positive examples (salient object pixels) in evaluating 
salient object detection models. Therefore, PR curves are 
more informative than ROC curves and can present an 
over optimistic view of an algorithm’s performance [107]. 
Thus we mainly base our conclusions on the PR curves 
scores {i.e., F-Measure scores), and also report other scores 
for comprehensive comparisons and for facilitating specific 
application requirements. It is worth mentioning that ac¬ 
tive research is ongoing to figure out the better ways of 
measuring salient object detection and segmentation models 
(e.g. [108]). 


D. Quantitative Comparison of Models 

We evaluate saliency maps produced by different models 
on six datasets by using all evaluation metrics: 

1) Fig. 6 and Fig. 7 show PR and ROC curves; 

2) Fig. 8 and Fig. 9 demonstrate AUC and MAE scores; 

3) Fig. 10 shows the scores of all models^. 

In terms of both PR and ROC curves, DRFI model 
surprisingly outperforms all other models on six benchmark 
datasets with large margins. Besides, RBD, DSR and MC 
(solid lines with blue, yellow, and magenta colors, re¬ 
spectively) achieve close performance and perform slightly 
better than other models. 

Using the F-measure {i.e., Fp), the five best models are: 
DRFI, MC, RBD, DSR, and GMR, where DRFI model 
consistently wins over all the 5 datasets. MC ranks the 
second best over 2 datasets and the third best over 2 
datasets. SR and SIM models perform the worst. 

With respect to the AUC score, DRFI again ranks the 
best over all six datasets. Following DRFI, DSR model 
ranks the second over 4 datasets. RBD ranks the second 
on 1 dataset and the third on 2 datasets. While PC A ranks 
the third on 1 dataset in terms of AUC score, it is not on 
the list of top three contenders using F^ measure. IT, LC, 
and SR achieve the worst performance. It is worth being 
mentioned that all the models perform well above chance 
level (AUC = 0.5) on six benchmark datasets. 

^Three segmentation methods are used, including adaptive threshold, 
fixed threshold, and SaliencyCut algorithm. The influence of segmentation 
methods will be discussed in Sect. III-A 
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Fig. 7. ROC curves of models on 6 benchmarks. False and true positive rates are shown in x and y axes, respectively. 


Rankings of models using MAE are more diverse than 
either Fj^ or AUC scores. DSR, RED and DRFI rank on 
the top, but none of them are among top three models over 
JuddDB. MC, which performs well in terms of Fj^ and 
AUC, is not included in the top three models on any dataset. 
PC A performs the best on JuddDB but worse on others. 
SIM and SVO models perform the worst. 

On average, the compared fixation prediction and object 
proposal generation models perform worse than salient 
object detection models. As two outliers, COV and BMS 
outperform several salient object detection models in terms 
of all evaluation metrics, implying that they are suitable 
for detecting salient proto objects. Additionally, Fig. 11 
shows the distribution of F^, ROC and MAE scores of all 
salient object detection models versus all fixation prediction 
models over all benchmark datasets. We can see a sharp 
separation of models especially for the Fj^ score, where 
most of the top models are salient object detection models. 
This result is consistent with the conclusion in [1] that 
fixation prediction models perform lower than salient object 
detection models. Though stemming from fixation predic¬ 
tion, research in salient object detection shares its unique 
properties and has truly added to what traditional saliency 
models focusing on fixation prediction already offer. 

In particular, most of the 28 salient object detection mod¬ 
els outperform the baseline AAM model. Among these 28 
models, AAM only outperforms 2 models over MSRAIOK, 
8 over ECSSD, 4 on THUR15K, 12 on JuddDB, and 4 on 
DUT-OMRON in terms of Ff^. Interestingly, AAM model 
does not outperform any model over SED2, which means 


that indeed there is less center bias in this dataset and 
salient object detection models can detect off-center objects. 
Notice that AAM ranks lowest on SED2 compared to other 
datasets. Please notice that it does not necessarily mean that 
models below AAM are not good, as taking advantage of 
the location prior may further enhance their performance 
{e.g., LC and FT). 

On average, over all models and scores, the performances 
were lower on JuddDB, DUT-OMRON and THUR15K, 
implying that these datasets were more challenging. The 
low model performance of JuddDB can be caused by both 
less center bias and small objects in images. Noisy labeling 
of DUT-OMRON dataset might also be a reason for low 
model performance. By investigating some images of these 
two datasets for which models performed low, we found 
that there are several objects that can be potentially the 
most salient one. This makes the generation of ground-truth 
quite subjective and challenging, although the most salient 
object in JuddDB has objectively been defined to be the 
most looked-at one measured from eye movement data. 

E. Qualitative Comparison of Models 

Fig. 12 shows output maps of all models for a sample 
image with relatively complex background. Dark blue areas 
are less salient while dark red indicates higher saliency 
values. Compared with other models, top contenders like 
DRFI and DSR suppress most of the background well 
while almost successfully detect the whole salient object. 
They thus generate higher precision scores and less false 
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Model 

THUR15K 

JuddDB 

DUT-OMRON 

SED2 

MSRAIOK 

ECSSD 

HDCT 

.878 

.771 

.869 

.898 

.941 

.866 

RBD 

.887 

.826 

.894 

.899 

.955 

.894 

GR 

.829 

.747 

.846 

.854 

.925 

.831 

MNP 

.854 

.768 

.835 

.888 

.895 

.820 

UFO 

.853 

.775 

.839 

.845 

.938 

.875 

MC 

.895 

.823 

.887 

.877 

.951 

.910 

DSR 

.902 

.826 

.899 

.915 

.959 

.914 

CHM 

.910 

.797 

.890 

.831 

.952 

.903 

GC 

.803 

.702 

.796 

.846 

.912 

.805 

LBI 

.876 

.792 

.854 

.896 

.910 

.842 

PCA 

.885 

.804 

.887 

.911 

.941 

.876 

DRFI 

.938 

.851 

.933 

.944 

.978 

.944 

GMR 

.856 

.781 

.853 

.862 

.944 

.889 

HS 

.853 

.775 

.860 

.858 

.933 

.883 

LMLC 

.853 

.724 

.817 

.826 

.936 

.849 

SF 

.799 

.711 

.803 

.871 

.905 
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Fig. 8. AUC: area under ROC curve (Higher is better. The top 
three models are highlighted in red, green and blue). 


positive rates. Some models that include a center-bias 
component also result in appealing maps, e.g., CB. Interest¬ 
ingly, region-based approaches, e.g., RC, HS, DRFI, BMR, 
CB, and DSR always preserve the object boundary well 
compared with other pixel-based or patch-based models. 

We can also clearly see the distinctness of different 
categories of models. Salient object detection models try to 
highlight the whole salient object and suppress the back¬ 
ground. Fixation prediction models often produce blob¬ 
like and sparse saliency maps corresponding to the fixation 
areas of humans on scenes. The objectness map is a rough 
indication of the salient object. The output of the latter two 
types of models might not suit to segment the whole salient 
object well. 
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Fig. 9. MAE: Mean Absolute Error (Smaller is better. The top 
three models are highlighted in red, green and blue). 


III. Pereormance Analysis 

Based on the performances reported above, we also 
conduct several experiments to provide a detailed analysis 
of all the benchmarking models and datasets. 

A. Analysis of Segmentation Methods 

In many computer vision and graphics applications, seg¬ 
menting regions of interest is of great practical importance 
[36], [44], [47]-[49], [109], [110]. The simplest way of 
segmenting a salient object is to binarize the saliency map 
using a fixed threshold, which might be hard to choose. 
In this section, we extensively evaluate two additional 
most commonly used salient object segmentation methods, 
including adaptive threshold [18] and SaliencyCut [68]. 
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Fig. 10. Fp statistics on each dataset, using varying fixed thresholds, adaptive threshold, and SaliencyCut (Higher is better. The top three models are 
highlighted in red, green and blue). 


Average Fj^ scores for salient object segmentation results 
on six benchmark datasets are shown in Fig. 10. Each seg¬ 
mentation algorithm was fed with saliency maps produced 
by all 40 compared models. 

Except JuddDB and SED2 datasets, best segmentation 
results are all achieved via SaliencyCut method combined 
with a sophisticated salient object detection model {e.g., 
DRFI, RED, MNP). This suggests that enforcing label 
consistency in terms of using graph-based segmentation and 
global appearance statistics benefits salient object segmen¬ 
tations. The default SaliencyCut [68] program only outputs 
the most dominate salient object, This causes results for 
SED2 and JuddDB benchmarks to be less optimal, as 
images in these two datasets (see Fig. 3) do not follow the 


“single none ambiguous salient object assumption” made 
in [68]. 

As also observed by most works in image segmentation 
literature, nearby pixels with similar appearance tend to 
have similar object labels. To validate this, we demonstrated 
in Fig. 13(a) some better segmentation results by further 
enforcing label consistency among nearby and similar pix¬ 
els. Enforcing such label consistency often helps improve 
labeling pixels specially when the majority of the salient 
object pixels have been highlighted in the detection phase. 
Challenging examples might still exist, however, such as 
complex object topology, spindle components, and similar 
appearance with respect to image background. More results 
of using the best combination, DRFI saliency maps and 
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Fig. II. Histogram of AUC, MAE, and Mean scores for salient object detection models (blue) versus fixation prediction models (red) collapsed 
over all six datasets. 



Fig. 12. Estimated saliency maps from various salient object detection 
models, object proposal generation model, average annotation map, and 
fixation prediction models. 


SaliencyCut segmentation, are demonstrated for images 
with various complexities, as shown in Fig. 13(b). 

A failure case of SaliencyCut segmentation along with 
intermediate results is also shown in the last row of Fig. 
13(a). Due to the complex topology of the salient ob¬ 
ject, label consistency in a local range considered in the 
SaliencyCut algorithm may not work well. Additionally, 
the appearance of the object looks very distinct due to the 
existence of shading and reflection, which makes the seg¬ 
mentation of the whole object very challenging. Therefore, 






(b) DRFI model output fed to the SaliencyCut algorithm. 
Fig. 13. Samples of salient object segmentation results. 


only a part of the object is Anally segmented. 


B. Analysis of Center Bias 

In this section, we study the center-bias challenge since 
it has caused a major problem in flxation prediction and 
salient object detection models. Some studies usually add 
a Gaussian center prior to models when comparing them. 
This might not be fair as several salient object detection 
models already contain center-bias at different levels. Al¬ 
ternatively, we randomly choose 1000 images with no/less 
center bias from the MSRAIOK dataset. First, the distance 
of salient object centroid to the image center is computed 
for each image. Those images for which such distance 
is bigger than a threshold are then chosen. Some sample 
images with no/less center-bias, as well as an illustration 
of the threshold of choosing images, are shown in Fig. 14. 
The average annotation of less center-biased images shows 
two peaks on the left and on the right of the image, which 
is suitable for testing the performance of salient object 
detection models on off-center images. 

We evaluate all the compared 40 models on these 1000 
images. PR and ROC curves, AUC, and MAE scores 
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Fig. 15. Results of center-bias analysis over 1000 less center-biased images chosen from the MSRAIOK dataset. Top: ROC and PR curves, Bottom: 
Mean AUC, and MAE scores for all models. 



Normalized object size 


Fig. 14. Left: Histogram of object center over all images, threshold (red 
line = 0.247), and annotation map over 1000 less center-biased images in 
MSRAIOK dataset. Right: Four less center-biased images. The overlaid 
circle illustrates the center-bias threshold. 


are all shown in Fig. 15. DRFI and DSR again perform 
the best. Overall, most models’ performance decrease when 
testing on no/less center biased images (p.g., the AUC score 
of MC declines from 0.951 to 0.888), while a few others 
show increase. For example, the AUC score of SVO raises 
from 0.930 to 0.942 and it gets the second ranking. Some 
models, e.g., HS (with the second ranking in terms of 
score), performs better according to their rank changes w.r.t 
the whole MSRAIOK dataset. DRFI still wins over other 
models here with a large margin. The difference in 
AUC, and MAE scores are not very big for this model 
over all data and 1000 less center-biased images (difference 
are 0.05, 0.05, and 0.009, respectively). This means that 
this model is not taking advantage of center-bias much. In 


the contrast, CB model uses a lot of location prior and 
that is why its performance drops heavily when applied 
to these images (difference are 0.122, 0.122, and 0.029, 
respectively). 

Additionally, it can be observed from Fig. 2(f), there is 
less center bias over the SED2 dataset where there is less 
activation in the center of its average annotation map. We 
can therefore study the center bias on it. Similarly, DRFI 
and DSR outperforms other models in terms of AUC, 
and MAE scores, indicating they are more robust to the 
location variations of salient objects. HS again ranks second 
according to the F^ score. Eig. 16 shows best and worst 
un-centered stimuli for DREI and DSR models. 

Overall, all the models perform well above the chance 
level over either the less center-biased subset of MSRAIOK 
or SED2. It is also worth noticing that the A AM model 
performs significantly worse on these two datasets, as well 
as JuddDB, validating our motivation of studying center 
bias on them. 

C. Analysis of Salient Object Existence 

Almost all of existing salient object detection models 
assume that there is at least one salient object in the 
input image. This impractical assumption might lead to 
less optimal performance on “background images”, which 
do not contain any dominated salient objects, as studied 
in [111]. To verify the effectiveness of models on back- 
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Fig. 16. Top and Bottom rows for each model illustrate best and worst 
cases in un-centered images. 



Fig. 17. Sample background-only images and prediction maps of DRFI, 
DSR, MC, and IT models. 


ground images, we collected 800 images from the web and 
evaluated compared models on them. 

We can see from Fig. 17 that there exist no dominated 
salient objects in background images that consist of only 
textures or cluttered background. A good model should 
generate a dark saliency map, i.e., without any activation 
as there are no salient objects. For quantitative evaluation, 
we only report the MAE score of each model, which 
is basically the sum of non-zero elements of the output 
saliency map. Note that it is not feasible to calculate PR and 
ROC curves here since the ground truth positive labeling 
here is empty. Also notice that ground truth of eye fixations 
do exist on such background images. 

Fig. 17 shows some sample background images and 
their output saliency maps using three top salient object 
detection models on a classical fixation prediction model. 
Fig. 18 reports MAE scores of 35 models. Top salient object 
detection models like DRFI, DSR, and MC do not perform 
well and often generate activations on the background 
images even though only regular textures exist (the third 
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Fig. 18. MAE over background-only images with no salient objects. 
Shaded area belongs to fixation prediction models. 



and fourth rows of Fig. 17). This is reasonable as they 
always assume there exist salient objects in the input image 
and will try their best to find some ones. On the other hand, 
they can be distracted by the clutter in the background since 
high contrast always exist on the cluttered region. Most of 
existing salient object detections compute saliency based 
on contrast values. These cluttered regions are thus more 
likely considered as salient. 

From Fig. 18, we can see that fixation prediction models 
COV and IT perform the best on background images in 
terms of the MAE scores. Compared with maps with dense 
salient regions produced by salient object models, maps 
generated by fixation prediction models often include sparse 
activations. See Fig. 17 for the output of the IT model. The 
sum of non-zero elements of such sparse saliency maps are 
smaller and thus the performance of COV and IT are better. 

D. Analysis of Worst and Best Cases for Top Models 

To understand what are the challenges for existing salient 
object detection models, we illustrate three the best and 
worst cases for top models over all six benchmark datasets. 
The stimuli for 11 top models were sorted according to the 
Fj 3 scores. We only give a demonstration of DRFI and MC 
models in Fig. 19 due to limited space. See our online 
challenge website for additional illustrations. 

It can be noticed from Fig. 19 that models share the 
same easy and difficult stimuli. Both DRFI and MC perform 
substantially well on the cases where a dominated salient 
object exists in a relatively clean background. Since most 
existing salient object detection models do not utilize any 
high-level prior knowledge, they may fail when a complex 
scene has a cluttered background or when the salient object 
is semantically salient {e.g., DRFI fails on images with 
faces in MSRAIOK). Another reason causing poor saliency 
detection is object size. Both DRFI and MC models have 
difficulty in detecting small objects (See hard cases on 
DUT-OMRON and JuddDB). 

Particularly, since saliency cues adopted by DRFI are 
mainly based on contrast, this model fails on scenes where 
salient objects share close appearance with the background 
{e.g., the hard cases of MSRAIOK and ECSSD). Another 
possible reason is related to the failure in segmenting the 
image. MC relies on the pseudo-background prior that the 
image border areas are background. That is why it fails on 
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scenes where the salient object touches the image border, 
e.g., the gorilla image in MSRAIOK dataset (4th row of 
the right column of Fig. 19). 

E. Runtime Analysis 

Runtime of 40 compared models are shown in Fig. 1 over 
all lOK images of MSRAIOK (typical image resolution of 
400 X 300) using an Intel Xeon E5645 2.40GHz CPU with 
8 GB RAM. The LC model here is the fastest (about 0.009 
seconds per image) followed by HC and GC models. The 
best model in our benchmark (DRFI) needs about 0.697 
seconds to process one image. 

IV. Discussions and Conclusions 

From the results obtained so far, we summarize in Fig. 
20 the rankings of models based on average performance 
over all 6 datasets in terms of segmentation methods, center 
bias, salient object existence, and run time^. Based on the 
rankings, we conclude that: 

''DRFI, RED, DSR, MC, HDCT, and HS are the top 6 
models for salient object detection'' 

By investigating the performances and the design choices 
of all compared models, our extensive evaluations do sug¬ 
gest some clear messages about commonly used design 
choices, which could be valuable for developing future 
algorithms. We refer readers to our recent survey [28] for a 
comprehensive review of different design choices adopted 
for salient object detection. 

• From the elements perspective, all top six models are 
built upon superpixels (regions). On the one hand, 
compared with pixels, more effective features {e.g., 
color histogram) can be extracted from regions. On 
the other hand, compared with patches, the boundary 
of the salient object is better preserved for region- 
based approaches, leading to more accurate detection 
performance. Moreover, since the number of pixels is 
far less than the number of pixels or patches, region- 
based methods has the potential to run faster. 

• All the top six models explicitly consider the back¬ 
ground prior, which assumes that that the area in the 
narrow border of the image belongs to the background. 
Compared with the location prior of a salient object, 
such a background prior performs more robust. 

• The leading method in our benchmark (i.e., DRFI), 
discriminatively trains a regression model to predict 
region saliency according to a 93-dimensional feature 
vector. Instead of purely relying on the cues extracted 
only from the input image, DRFI resorts to human an¬ 
notations to automatically discover feature integration 
rules. The high performance of this simple learning- 
based method encourages pursuing data-driven ap¬ 
proaches for salient object detection. 

^We have created a unified repository for sharing code and data where 
researchers can run models with a single click or can add new models for 
benchmarking purposes. All codes, data, and results are available in our 
online benchmark website: http://mmcheng.net/saIobjbenchmark/ 


Even considering top performing models, salient object 
detection still seems far from being solved. To achieve more 
appealing results, three challenges should be addressed. 
Eirst, in our large-scale benchmark (see Sec. II), all top 
performing algorithms use the location prior cues, limiting 
their adaptation to general cases. Second, although the 
ranking of top scoring models are quite consistent across 
datasets, performance scores (F^ and AUC) drop signif¬ 
icantly from easier datasets to more difficult ones. Third 
challenge regards the run time of models. Some models 
need around one minute to process a 400 x 300 image (e.g., 
CA: 40.9s, SVO: 56.5s, and LMLC 140s). 

One area for future research would be designing scores 
for tackling dataset biases and evaluation of saliency seg¬ 
mentation maps with respect to ground-truth annotations 
similar to [108]. In this benchmark, we only focused 
on single-input scenarios. Although some RGBD datasets 
exist [112], benchmark datasets for multiple input images 
(e.g., salient object detection on videos, co-salient object 
detection [28]) are still lacking. Another future direction 
will be following active segmentation algorithms (e.g., [99], 
[113], [114]) by segmenting a salient object from a seed 
point. Einally, aggregation of saliency models for build¬ 
ing a strong prediction model (similar to [1], [115], and 
behavioral investigation of saliency judgments by humans 
(e.g., [21], [116]) are two other interesting directions. 
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(a) DRFI (b) MC 

Fig. 19. Best (1st rows for each model on a dataset) and worst (2nd rows) cases of DRFI and MC. Ground-truth object(s) is denoted by a red contour. 
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